Search CORE

260 research outputs found

The Degrees of Freedom of Partial Least Squares Regression

Author: Akaike H.
Brown P.
Krämer N.
Lanczos C.
Leisch F.
Masashi Sugiyama
Nicole Krämer
Publication venue
Publication date: 01/01/2010
Field of study

The derivation of statistical properties for Partial Least Squares regression can be a challenging task. The reason is that the construction of latent components from the predictor variables also depends on the response variable. While this typically leads to good performance and interpretable models in practice, it makes the statistical analysis more involved. In this work, we study the intrinsic complexity of Partial Least Squares Regression. Our contribution is an unbiased estimate of its Degrees of Freedom. It is defined as the trace of the first derivative of the fitted values, seen as a function of the response. We establish two equivalent representations that rely on the close connection of Partial Least Squares to matrix decompositions and Krylov subspace techniques. We show that the Degrees of Freedom depend on the collinearity of the predictor variables: The lower the collinearity is, the higher the Degrees of Freedom are. In particular, they are typically higher than the naive approach that defines the Degrees of Freedom as the number of components. Further, we illustrate how the Degrees of Freedom approach can be used for the comparison of different regression methods. In the experimental section, we show that our Degrees of Freedom estimate in combination with information criteria is useful for model selection.Comment: to appear in the Journal of the American Statistical Associatio

arXiv.org e-Print Archive

CiteSeerX

Crossref

Publications Server of the Weierstrass Institute for Applied Analysis and Stochastics

Repositorium für Naturwissenschaften und Technik

Research Papers in Economics

Modelling time course gene expression data with finite mixtures of linear additive models

Author: Androulakis
B. Grun
F. Leisch
Luan
Maugis
Spellman
T. Scharl
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Summary: A model class of finite mixtures of linear additive models is presented. The component-specific parameters in the regression models are estimated using regularized likelihood methods. The advantages of the regularization are that (i) the pre-specified maximum degrees of freedom for the splines is less crucial than for unregularized estimation and that (ii) for each component individually a suitable degree of freedom is selected in an automatic way. The performance is evaluated in a simulation study with artificial data as well as on a yeast cell cycle dataset of gene expression levels over time

Crossref

PubMed Central

Research Online

Morphology of obligate ectosymbionts reveals Paralaxus gen. nov.: A new circumtropical genus of marine stilbonematine nematodes

Author: Gruber-Vodicka H.
Leisch N.
Ott J.
Scharhauser F.
Zimmermann J.
Publication venue
Publication date: 01/05/2020
Field of study

Stilbonematinae are a subfamily of conspicuous marine nematodes, distinguished by a coat of sulphur‐oxidizing bacterial ectosymbionts on their cuticle. As most nematodes, the worm hosts have a relatively simple anatomy and few taxonomically informative characters, and this has resulted in numerous taxonomic reassignments and synonymizations. Recent studies using a combination of morphological and molecular traits have helped to improve the taxonomy of Stilbonematinae but also raised questions on the validity of several genera. Here, we describe a new circumtropically distributed genus Paralaxus (Stilbonematinae) with three species: Paralaxus cocos sp. nov., P. bermudensis sp. nov. and P. columbae sp. nov. We used single worm metagenomes to generate host 18S rRNA and cytochrome c oxidase I (COI) as well as symbiont 16S rRNA gene sequences. Intriguingly, COI alignments and primer matching analyses suggest that the COI is not suitable for PCR‐based barcoding approaches in Stilbonematinae as the genera have a highly diverse base composition and no conserved primer sites. The phylogenetic analyses of all three gene sets, however, confirm the morphological assignments and support the erection of the new genus Paralaxus as well as corroborate the status of the other stilbonematine genera. Paralaxus most closely resembles the stilbonematine genus Laxus in overlapping sets of diagnostic features but can be distinguished from Laxus by the morphology of the genus‐specific symbiont coat. Our re‐analyses of key parameters of the symbiont coat morphology as character for all Stilbonematinae genera show that with amended descriptions, including the coat, highly reliable genus assignments can be obtained

MPG.PuRe

New consistency index based on inertial operating speed

Author: Cafiso S.
Camacho-Torregrosa F. J.
Gibreel G. M.
Hassan Y.
Kanellaidis G.
Lamm R.
Leisch J. E.
Ng J. C. W.
Polus A.
Treat J. R.
Publication venue: 'Transportation Research Board'
Publication date: 01/01/2013
Field of study

The occurrence of road crashes depends on several factors, with design consistency (i.e., conformance of highway geometry to drivers' expectations) being one of the most important. A new consistency model for evaluating the performance of tangent-to-curve transitions on two-lane rural roads was developed. This model was based on the inertial consistency index (ICI) defined for each transition. The ICI was calculated at the beginning point of the curve as the difference between the average operating speed on the previous 1-km road segment (inertial operating speed) and the actual operating speed at this point. For the calibration of the ICI and its thresholds, 88 road segments, which included 1,686 tangent-to-curve transitions, were studied. The relationship between those results and the crash rate associated with each transition was analyzed. The results showed that the higher the ICI was, the higher the crash rate; thus, the probability of accidents increased. Similar results were obtained from the study of the relationship between the ICI and the weighted average crash rate of the corresponding group of transitions. A graphical and statistical analysis established that road consistency might be considered good when the ICI was lower than 10 km/h, poor when the ICI was higher than 20 km/h, and fair otherwise. A validation process that considered 20 road segments was performed. The ICI values obtained were highly correlated to the number of crashes that had occurred at the analyzed transitions. Thus, the ICI and its consistency thresholds resulted in a new approach for evaluation of consistency.The authors thank the Center for Studies and Experimentation of Public Works of the Spanish Ministry of Public Works, which partially subsidized the data collection, for obtaining the empirical operating speed profiles used in the validation process. The authors also thank the General Directorate of Public Works of the Infrastructure and Transportation Department of the Valencian government, the Valencian Province Council, and the General Directorate of Traffic of the Ministry of the Interior of the Government of Spain for their cooperation in data gathering.García García, A.; Llopis Castelló, D.; Camacho Torregrosa, FJ.; Pérez Zuriaga, AM. (2013). New consistency index based on inertial operating speed. Transportation Research Record. (2391):105-112. doi:10.3141/2391-10S1051122391Ng, J. C. ., & Sayed, T. (2004). Effect of geometric design consistency on road safety. Canadian Journal of Civil Engineering, 31(2), 218-227. doi:10.1139/l03-090Gibreel, G. M., Easa, S. M., Hassan, Y., & El-Dimeery, I. A. (1999). State of the Art of Highway Geometric Design Consistency. Journal of Transportation Engineering, 125(4), 305-313. doi:10.1061/(asce)0733-947x(1999)125:4(305)Hassan, Y. (2004). Highway Design Consistency: Refining the State of Knowledge and Practice. Transportation Research Record: Journal of the Transportation Research Board, 1881(1), 63-71. doi:10.3141/1881-08Polus, A., & Mattar-Habib, C. (2004). New Consistency Model for Rural Highways and Its Relationship to Safety. Journal of Transportation Engineering, 130(3), 286-293. doi:10.1061/(asce)0733-947x(2004)130:3(286)Cafiso, S., Di Graziano, A., Di Silvestro, G., La Cava, G., & Persaud, B. (2010). Development of comprehensive accident models for two-lane rural highways using exposure, geometry, consistency and context variables. Accident Analysis & Prevention, 42(4), 1072-1079. doi:10.1016/j.aap.2009.12.015Zuriaga, A. M. P., García, A. G., Torregrosa, F. J. C., & D’Attoma, P. (2010). Modeling Operating Speed and Deceleration on Two-Lane Rural Roads with Global Positioning System Data. Transportation Research Record: Journal of the Transportation Research Board, 2171(1), 11-20. doi:10.3141/2171-0

Crossref

RiuNet

Combining an Additive and Tree-Based Regression Model Simultaneously: STIMA

Author: Bart Jan Van Os
Claudio Conversano
Elise Dusseldorp
Leisch F.
Loh W. Y.
Quinlan J. R.
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref

Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model

Author: A Dempster
AT Weeraratna
D Porter
D Porter
DA Porter
F Leisch
F van Ruissen
G Schwarz
GJ McLachlan
H Akaike
H Matsumura
HH Thygesen
J Lu
K Boon
KA Baggerly
KA Baggerly
M Cornelissen
R Development Core Team
R Edgar
RZ Vencio
S Lee
S Saha
Scott D Zuyderduyn
VA Kuznetsov
VE Velculescu
WN Venables
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated. Results The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted. Conclusion The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods

Author: A Hess
AA Fodor
AR Dabney
C Li
DB Allison
DP Gaile
E Hubbell
E Schuster
F Leisch
G Smyth
L Shi
LM Cope
P Baldi
RA Irizarry
RA Irizarry
RC Gentleman
Richard D Pearson
S Hochreiter
S Lemieux
SE Choe
T Sing
VG Tusher
X Liu
X Liu
Z Chen
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws. Results We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison. Conclusion We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evaluation strategies for isotope ratio measurements of single particles by LA-MC-ICPMS

Author: A Axelsson
AP Dempster
B Grün
B. Hattendorf
D. Günther
D. Koffler
DA Skoog
DL Donohue
DL Donohue
E Krupp
EM Krupp
EM Krupp
F Leisch
F Leisch
F Pointurier
F. Leisch
G. Laaha
I Günther-Leopold
I Günther-Leopold
J Fietzke
JA Rodríguez-Castrillón
JM Cottle
K Mayer
L. Dorta
M Dzurko
MB Fricker
NS Lloyd
P Galler
P Rodríguez-González
Q Xie
RD Evans
S Bürger
S Kappel
S Wehmeier
S. F. Boulyga
S. Kappel
SF Boulyga
T Hirata
T Hirata
T Pettke
T. Prohaska
VN Epov
Y Aregbe
Z Varga
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Visualization of proteomics data using R and bioconductor.

Author: Adler D.
Adler D.
Balcome S.
Bemis K. D.
Carey V.
Carlson M.
Chang W.
Chen H.
Claesen J.
Csardi G.
Futschik M.
Galili T.
Gatto L.
Gatto L.
Gatto L.
Gatto L.
Gatto L.
Gatto L.
Gentleman R.
Gentleman R.
Gregori J.
Gregori J.
Hahne F.
Hansen K. D.
Husson F.
Leisch F.
Li X.
Murrell P.
Naake T.
Nyakas A.
Panse C.
Pedersen T. L.
Ploner A.
R Core Team 2014 R: A Language and Environment for Statistical Computing
Sauteraud R.
Shannon P.
Tarca A. L.
Turewicz M.
Warnes G. R.
Wei T.
Wen B.
Wilkinson L.
Xie Y.
Xie Y.
Publication venue: Proteomics
Publication date: 01/04/2015
Field of study

Data visualization plays a key role in high-throughput biology. It is an essential tool for data exploration allowing to shed light on data structure and patterns of interest. Visualization is also of paramount importance as a form of communicating data to a broad audience. Here, we provided a short overview of the application of the R software to the visualization of proteomics data. We present a summary of R's plotting systems and how they are used to visualize and understand raw and processed MS-based proteomics data.LG was supported by the European Union 7th Framework Program (PRIME-XS project, grant agreement number 262067) and a BBSRC Strategic Longer and Larger grant (Award BB/L002817/1). LMB was supported by a BBSRC Tools and Resources Development Fund (Award BB/K00137X/1). TN was supported by a ERASMUS Placement scholarship.This is the final published version of the article. It was originally published in Proteomics (PROTEOMICS Special Issue: Proteomics Data Visualisation Volume 15, Issue 8, pages 1375–1389, April 2015. DOI: 10.1002/pmic.201400392). The final version is available at http://onlinelibrary.wiley.com/doi/10.1002/pmic.201400392/abstract

Crossref

PubMed Central

Apollo (Cambridge)

Accounting for uncertainty when assessing association between copy number and disease: a latent class model

Author: A Rovelet-Lecrux
Alejandro Cáceres
BE Stranger
C Fraley
C Le Marechal
D Spiegelman
DP Locke
E Gonzalez
F Leisch
F Picard
Geòrgia Escaramís
I Ionita-Laza
Isaac Subirana
J Du
J González
JP Schouten
Juan R González
K Fellermann
KK Wong
L Feuk
Lluís Armengol
MAvan de Wiel
Mvan de Wiel
O Davidov
R Redon
RM Neve
S Bashir
S Engert
S Greenland
S Sarkar
Solymar Peraza
T Aitman
T Hansen
WN van Wieringen
Xavier Estivill
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/06/2009
Field of study

Abstract Background Copy number variations (CNVs) may play an important role in disease risk by altering dosage of genes and other regulatory elements, which may have functional and, ultimately, phenotypic consequences. Therefore, determining whether a CNV is associated or not with a given disease might be relevant in understanding the genesis and progression of human diseases. Current stage technology give CNV probe signal from which copy number status is inferred. Incorporating uncertainty of CNV calling in the statistical analysis is therefore a highly important aspect. In this paper, we present a framework for assessing association between CNVs and disease in case-control studies where uncertainty is taken into account. We also indicate how to use the model to analyze continuous traits and adjust for confounding covariates. Results Through simulation studies, we show that our method outperforms other simple methods based on inferring the underlying CNV and assessing association using regular tests that do not propagate call uncertainty. We apply the method to a real data set in a controlled MLPA experiment showing good results. The methodology is also extended to illustrate how to analyze aCGH data. Conclusion We demonstrate that our method is robust and achieves maximal theoretical power since it accommodates uncertainty when copy number status are inferred. We have made <monospace>R</monospace> functions freely available.</p

Crossref

Directory of Open Access Journals

PubMed Central